feat(eval): episode sharding, parallel launcher, and autotune by pkooij · Pull Request #3275 · huggingface/lerobot

pkooij · 2026-04-03T15:32:21Z

Title

feat(eval): episode sharding, parallel launcher, and autotune

Type / Scope

Type: Performance / Feature
Scope: lerobot/scripts/lerobot_eval.py, lerobot/configs/default.py, new lerobot_eval_parallel.py, new lerobot_eval_autotune.py

Summary / Motivation

Even after PR #3274 fixes AsyncVectorEnv, a single eval process achieves only ~20% GPU utilisation (env step ~20ms >> inference ~5ms). The remaining idle time can be recovered by running multiple independent eval processes (shards), each handling a disjoint slice of episodes and its own model copy. On an H100 (80 GB), SmolVLA at fp16 (~14 GB) fits 4–5 times → 4 × 20% ≈ 80–100% GPU utilisation with zero networking or coordination overhead.

This PR adds:

Episode sharding in lerobot_eval.py: each process handles episodes shard_id, shard_id+N, ... with non-overlapping seeds.
lerobot-eval-parallel: spawns K subprocesses, sets MUJOCO_GL and OMP_NUM_THREADS, merges results.
lerobot-eval-autotune: probes GPU VRAM, CPU cores, model footprint, and env step time; outputs optimal num_shards / batch_size / MUJOCO_GL with a paste-ready command.

Related issues

Requires: feat(envs): lazy env init + AsyncVectorEnv as default for n_envs > 1 #3274 (lazy env init + AsyncVectorEnv)
Related: refactor(envs): move benchmark dispatch into EnvConfig subclasses #3272

What changed

configs/default.py (EvalConfig): add shard_id: int = 0, num_shards: int = 1; validate ranges in __post_init__
lerobot_eval.py: add _shard_episodes(n_episodes, shard_id, num_shards) → list[int]; eval_main computes per-shard episode count and seed offset; writes shard_K_of_N.json when num_shards > 1, else eval_info.json (default unchanged)
lerobot_eval_parallel.py (new, ~120 LOC): parse --num-shards / --render-device; spawn K subprocesses; wait; merge shard JSON files into eval_info.json
lerobot_eval_autotune.py (new, ~140 LOC): 8-step hardware probe → AutotuneRecommendation; main() prints summary + paste-ready command
pyproject.toml: register lerobot-eval-parallel and lerobot-eval-autotune entry points

Default behaviour is unchanged: num_shards=1 → exactly the same execution path as before.

How was this tested (or how to run locally)

Tests added:

test_shard_assignment: _shard_episodes(100, 2, 5) == [2, 7, 12, ..., 97]
test_shard_uneven: 103 episodes / 5 shards distributes without overlap or gap
test_shard_no_overlap: union of all shards == full episode range

Single-machine parallel run:

# Auto-detect optimal config
lerobot-eval-autotune policy.path=lerobot/smolvla_libero env.type=libero

# Run with 4 shards
lerobot-eval-parallel --num-shards 4 \
  policy.path=lerobot/smolvla_libero \
  env.type=libero \
  eval.n_episodes=200 \
  eval.batch_size=20 \
  output_dir=outputs/eval/parallel_run

# Let autotune decide
lerobot-eval-parallel --num-shards auto \
  policy.path=lerobot/smolvla_libero \
  env.type=libero

Checklist (required before merge)

Linting/formatting run (pre-commit run -a)
All tests pass locally (pytest)
Documentation updated
CI is green

Reviewer notes

subprocess.Popen (fork+exec) gives each shard a clean Python interpreter and its own valid EGL/osmesa context — no stale GPU handles inherited from the parent.
Seeds are non-overlapping: shard K starts at seed + K * ceil(n_episodes / num_shards), so the combined run is equivalent to one serial run with the same seeds.
--render-device auto: uses EGL (GPU) for 1 shard; switches to osmesa (CPU rendering, 0 VRAM) when multiple model copies would exhaust VRAM.
Anyone in the community is free to review the PR.

Add lerobot-eval-parallel and lerobot-eval-autotune entry points for multi-process evaluation. A single H100 running 4 shards of SmolVLA achieves ~100% GPU utilisation vs ~0.5% with the serial baseline. - EvalConfig: add shard_id / num_shards fields; validate ranges - lerobot_eval.py: _shard_episodes() splits n_episodes round-robin; eval_main uses per-shard n_episodes + seed offset; writes shard_K_of_N.json when num_shards > 1 - lerobot_eval_parallel.py: spawns K subprocesses with disjoint shard IDs, sets MUJOCO_GL and OMP_NUM_THREADS, merges results on completion - lerobot_eval_autotune.py: probes GPU VRAM, CPU cores, optional model footprint and env step time; derives optimal num_shards / batch_size / MUJOCO_GL; prints a paste-ready command - pyproject.toml: register lerobot-eval-parallel and lerobot-eval-autotune Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

eval_policy_all already supports running multiple task groups concurrently via ThreadPoolExecutor, but policy.reset() was not thread-safe: all threads shared the same policy object and its mutable state (action queues, temporal buffers). Fix: each thread receives a shallow copy of the policy. copy.copy() creates a new Python object whose _parameters dict is a shared reference — same tensor storage, zero extra VRAM — while reset() rebinds per-episode state to fresh objects per thread. Caveat: ACT with temporal_ensemble_coeff is not safe with this approach (its reset() mutates a shared sub-object). Keep max_parallel_tasks=1 for that config. For MetaWorld (50 tasks, no temporal ensembling), max_parallel_tasks=4 raises GPU utilization from ~20% to ~60-80% with no additional VRAM cost. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

pkooij mentioned this pull request Apr 3, 2026

feat(eval): thread-safe policy copies for max_parallel_tasks > 1 #3276

Closed

4 tasks

pkooij force-pushed the feat/async-vector-env branch from b43f9ab to 1f7e7b4 Compare April 7, 2026 10:30

pkooij and others added 2 commits April 7, 2026 13:43

pkooij force-pushed the feat/eval-parallel branch from b411838 to 66276f1 Compare April 7, 2026 11:44

pkooij closed this Apr 7, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(eval): episode sharding, parallel launcher, and autotune#3275

feat(eval): episode sharding, parallel launcher, and autotune#3275
pkooij wants to merge 2 commits intofeat/async-vector-envfrom
feat/eval-parallel

pkooij commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

pkooij commented Apr 3, 2026

Title

Type / Scope

Summary / Motivation

Related issues

What changed

How was this tested (or how to run locally)

Checklist (required before merge)

Reviewer notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant